This document is an online supplement to
Evert, Stephanie (2022). Measuring Keyness. In: Proceedings of the Digital Humanities 2022 Conference, online/Tokyo, Japan.
It provides a reference implementation of the LRC keyness measure proposed in this paper, as well as replication code for the visualisation and analysis.
An user-friendly implementation of LRC will be made available as part
of the keyness() function in the R package
corpora version 0.6 (to be released by August 2022).
[Update: now scheduled for July 2023]
Keyness measures compare the relative frequency \(\pi_1\) (= occurrence probability) of a candidate term \(w\) in a linguistic population A with its frequency \(\pi_2\) in a population B, based on independent random samples (= corpora) A and B. This computation is repeated for each candidate term \(w\in C\), then candidates are ranked according to their keyness scores and cutoff thresholds may be applied.
For each candidate term, the observed data consist of
Recent studies (Evert et al. 2018, Egbert & Biber 2019) suggest it is more robust to use document frequencies instead of term frequencies; in this case, \(n_1\) and \(n_2\) are the numbers of documents in the two samples.
NB: It is quite common to have some candidates with \(f_2 = 0\), i.e. terms that only occur in sample A, but not in sample B. In a symmetric keyword analysis (where A and B have equal status), the case \(f_1 = 0\) is also possible.
The observed candidate data can be collected in the form of a \(2\times 2\) contingency table, which represents a cross-classification of items and is the basis of many statistical techniques. The first column corresponds to sample A, the second column to sample B; the first row lists the frequency of the candidate term \(w\), the second row the total frequency of all other terms \(w'\).
\[ \begin{array}{|c|c|} \hline O_{11} = f_1 & O_{12} = f_2 \\ \hline O_{21} = n_1 - f_1 & O_{22} = f_2 - n_2 \\ \hline \end{array} \]
The columns of this contingency table form two independent binomial distributions with
\[ \mathbb{P}(f_1) = \binom{n_1}{f_1} \pi_1^{f_1} (1 - \pi_1)^{n_1 - f_1} \qquad \mathbb{P}(f_2) = \binom{n_2}{f_2} \pi_2^{f_2} (1 - \pi_2)^{n_2 - f_2} \]
that can be approximated very well by Poisson distributions
\[ \mathbb{P}(f_1) = \frac{e^{-\lambda_1}\cdot \lambda_1^{f_1}}{f_1 !}, \;\; \lambda_1 = n_1 \pi_1 \qquad \mathbb{P}(f_2) = \frac{e^{-\lambda_2}\cdot\lambda_2^{f_2}}{f_2 !}, \;\; \lambda_2 = n_2 \pi_2 \]
for the typical use case of lexical frequencies (\(\pi_1, \pi_2 \ll 1\)).
Perhaps the most widely used keyness measures are based on statistical hypothesis and assess the statistical significance of each candidate term \(w\), i.e. the amount of evidence against the null hypothesis
\[ H_0: \pi_1 = \pi_2 \]
where there is no frequency difference between the two populations. We can compute the expected values of the contingency table under \(H_0\) (cf. Evert 2008: 1251f) as
\[ \begin{array}{|c|c|} \hline E_{11} = \frac{n_1 (f_1 + f_2)}{n_1 + n_2} & E_{12} = \frac{n_2 (f_1 + f_2)}{n_1 + n_2} \\ \hline E_{21} = n_1 - E_{11} & E_{22} = n_2 - E_{12} \\ \hline \end{array} \]
Most statistical tests for contingency tables are based on a comparison of the observed (\(O_{ij}\)) and expected (\(E_{ij}\)) continency tables. Popular significance keyness measures are the chi-squared test (Scott 1997: 236)
\[ X^2 = \sum_{i=1}^2 \sum_{j=1}^2 \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]
and in particular the log-likelihood measure (Rayson & Garside 2000: 3)
\[ G^2 = 2 \sum_{i=1}^2 \sum_{j=1}^2 O_{ij} \log \frac{O_{ij}}{E_{ij}} \]
The resulting scores indicate the amount of information against \(H_0\) provided by the two samples, i.e. higher values indicated stronger keyness. They can be translated into p-values, or alternatively cutoff thresholds corresponding to customary significance levels can be specified. It is common in keyword analysis to work with the raw keyness scores, though.
Both \(G^2\) and \(X^2\) are two-sided tests, i.e. they do not distinguish between the alternative hypotheses \(\pi_1 > \pi_2\) (which we’re interested in) and \(\pi_1 < \pi_2\) (sometimes referred to as negative keywords). For this reason, I prefer to assign a negative sign to \(G^2\) and \(X^2\) scores if \(f_1/n_1 < f_2/n_2\) (or equivalently \(O_{11} < E_{11}\)). NB: many other software packages do not make this adjustment, but may provide separate filters for positive and negative keywords.
Here I provide a reference implementation of
log-likelihood \(G^2\)
with negative scores for negative keywords. The function
G2() must be called with arguments \(f_1, f_2, n_1, n_2\), where \(f_1, f_2\) are vectors of the same length
(with corresponding entries for each candidate term \(w\)) and \(n_1,
n_2\) are usually scalar values specifying the fixed sample sizes
of A and B.
G2.term <- function (O, E) {
res <- O * log(O / E)
res[O == 0] <- 0
res
}
G2 <- function (f1, f2, N1, N2, alpha=NULL, correct=TRUE) {
stopifnot(length(f1) == length(f2))
## observed and expected contingency tables
N <- N1 + N2
R1 <- f1 + f2
O11 <- f1; E11 <- R1 * N1 / N
O12 <- f2; E12 <- R1 * N2 / N
O21 <- N1 - f1; E21 <- N1 - E11
O22 <- N2 - f2; E22 <- N2 - E12
## log-likelihood statistic (simplest formula)
G2 <- 2 * (G2.term(O11, E11) + G2.term(O12, E12) + G2.term(O21, E21) + G2.term(O22, E22))
res <- sign(O11 - E11) * G2 # set sign to distinguish positive vs. negative keywords
## weed out non-significant items if alpha is specified
if (!is.null(alpha)) {
if (correct) alpha <- alpha / length(f1)
theta <- qchisq(alpha, df=1, lower.tail=FALSE)
res[G2 < theta] <- 0 # set to 0 if not significant at level alpha
}
res
}
Significance measures can also be used to weed out candidates that are not statistically significant. Appropriate thresholds for a given significance level \(\alpha\) are obtained from the asymptotic \(\chi_1\) distribution of \(X^2\) and \(G^2\) under \(H_0\). If a significance level is specified with the alpha= argument, G2() will set return a keyness score of 0 for any candidates that are not significant at level \(\alpha\).
Hardie (2014: 51-52) argues that such significance filters need to take multiple testing into account: an independent significance test is carried out for each candidate that can generate a false positive with probability \(\alpha\). Even a conservative significance level such as \(\alpha = .001\) might result in hundreds or thousands of false positives when applied to hundreds of thousands of candidates extracted from a large corpus. It is therefore advisable to control the family-wise error rate (FWER), i.e. the probability of generating at least one false positive in the entire set of \(m\) tests. Hardie argues for a Šidák correction, which reduces the significance level for individual tests to
\[
\alpha' = 1 - (1 - \alpha)^{\frac1m}
\]
while I prefer the more conservative, but simpler and numerically more
robust Bonferroni correction
\[ \alpha' = \frac{\alpha}{m} \]
G2() applies a Bonferroni correction by default if
alpha= is specified, based on the number of candidates
passed to the function, i.e. \(m =\)
length(f1). The adjustment can be disabled by passing
correct=FALSE.
Significance measures have repeatedly been criticised in corpus linguistics because they focus entirely on the amount of evidence against \(H_0\) rather than on how big the difference between \(\pi_1\) and \(\pi_2\) is. As a result, high-frequency candidate term may obtain high keyness scores even if the difference in relative frequency between A and B is quite small, making them irrelevant from a linguistic perspective.
Several researchers have thus recommended the use of effect-size measures in keyword analysis. The simplest and most intuitive measure is relative risk, the ratio between the relative frequencies in the two samples:
\[ r = \frac{f_1}{n_1} / \frac{f_2}{n_2} \]
Hardie (2014: 45) proposes LogRatio \(= \log_2 r\) (i.e. logarithmic relative risk) as an intuitive keyness measure. The logarithm transforms scores into a more conveniently readable range and ensures that positive and negative keywords can be distinguished by the sign of the score.
Most other effect size measures that have been suggested in the context of keyword analysis are either equivalent to LogRatio – such as %DIFF \(= 100\cdot (r - 1)\) (Gabrielatos & Marchi 2012) – or closely related – such as odds ratio (Pojanapunya & Watson Todd 2018)
\[ \hat{\theta} = \frac{f_1}{n_1 - f_1} / \frac{f_2}{n_2 - f_2} \]
and \(\mathbf{\Delta P} = (r - 1) \frac{f_2}{n_2}\) (Gries 2013: 143f).
One problematic aspect of LogRatio and most related effect-size measures is that they are not defined for the case \(f_2 = 0\). Hardie (2014: 45) recommends to set \(f_2 = \frac12\) in this case (and \(f_1 = \frac12\) vice versa). It is difficult to justify this particular value, though, and other researchers have proposed much smaller values such as setting \(\frac{f_2}{n_2} = 10^{-18}\) (one quadrillionth, Gabrielatos & Marchi 2012).
[Update] A better solution is to use Walter’s (1975) adjusted estimator \[ \text{LogRatio} = \log_2 \frac{f_1 + \frac12}{n_1 + \frac12} - \log_2 \frac{f_2 + \frac12}{n_2 + \frac12} \] which is well-defined for \(f_1 = 0\) or \(f_2 = 0\) and provides a nearly unbiased estimate of \(\log_2 r\) (Gart & mo Nam 1988: 324).
[Update] My reference implementation of LogRatio defaults to Walter’s
adjustment but also provides Hardie’s version, which is implemented in
the CQPweb corpus platform (Hardie 2012). The LogRatio()
function takes the same arguments as G2() and can
optionally be combined with a \(G^2\)-based significance filter, setting
keyness scores to 0 for any candidates that are not statistically
significant.
LogRatio <- function (f1, f2, N1, N2, adjust=c("walter", "hardie"), alpha=NULL, correct=TRUE) {
stopifnot(length(f1) == length(f2))
stopifnot(all(f1 >= 1))
adjust <- match.arg(adjust)
## point estimate of LogRatio = log2(relative risk)
if (adjust == "hardie") {
f2.disc <- pmax(f2, 0.5) # questionable discounting of f2=0 according to Hardie (2014)
res <- log2((f1 / N1) / (f2.disc / N2))
} else {
res <- log2((f1 + 0.5) / (N1 + 0.5)) - log2((f2 + 0.5) / (N2 + 0.5))
}
## weed out non-significant items if alpha is specified
if (!is.null(alpha)) {
G2.scores <- G2(f1, f2, N1, N2, alpha=alpha, correct=correct)
res[G2.scores == 0] <- 0 # set LogRatio to zero if difference is not significant according to G2
}
res
}
Pure effect-size measures, as those listed above, have an important drawback: they ignore statistical uncertainty in the form of sampling variation and are prone to yielding spuriously high scores for low-frequency candidate terms. The resulting low-frequency bias of effect-size measures can only partially be mitigated by adding a significance filter or a frequency threshold such as \(f_1 > k\).
Various heuristic variations of effect-size measures have been proposed to control the low-frequency bias more effectively. The most widely-used example is SimpleMaths (Kilgarriff 2009) implemented by the SketchEngine platform:
\[ \mathop{\text{SM}} = \frac{10^6\cdot \frac{f_1}{n_1} + \lambda}{10^6\cdot \frac{f_2}{n_2} + \lambda} \]
where \(\lambda > 0\) is a user-parameter that adjusts the frequency bias of SM (with large values resulting in a bias towards high-frequency candidates). The scaling factor of \(10^6\) was chosen so that \(\lambda = 1\) is a practical default value.
My reference implementation SimpleMaths() takes the
additional argument lambda=. Following SketchEngine, it
does not offer a combination with a significance filter. As an
extension, log=TRUE returns logarithmic values with a more
reasonable scaling and sign corresponding (roughly) to positive
vs. negative keywords.
SimpleMaths <- function (f1, f2, N1, N2, lambda=1, log=FALSE) {
p1 <- f1 / N1
p2 <- f2 / N2
res <- (1e6 * p1 + lambda) / (1e6 * p2 + lambda)
if (log) log2(res) else res
}
Heuristic corrections such as SimpleMaths are unsatisfactory because they lack any mathematical justification, even if they might work reasonably well in practice. Due to the wide range of different applications of keyness measures and the paucity of thorough empirical evaluation studies, it seems particularly important to use a keyness measure with well-understood mathematical properties.
The mathematical reason behind the unreliability of effect-size measures is that they do not directly measure the quantity of interest, e.g. the relative risk or odds ratio (etc.) between populations A and B:
\[ \rho = \frac{\pi_1}{\pi_2} \qquad \theta = \frac{\pi_1}{1 - \pi_1} / \frac{\pi_2}{1 - \pi_2} \]
but rather estimate the quantity from random samples A and B. Such point estimates fail to take sampling variation into account, resulting in a high degree of uncertainty for low-frequency data.
A better approach is to determine confidence intervals of plausible values (e.g. \([\rho_{-}, \rho_{+}]\)) from the observed contingency table. These are obtained by inverting statistical hypothesis tests, collecting all values \(x\) for which the null hypothesis (\(H_0: \rho = x\)) cannot be rejected at a given significance level \(\alpha\). A conservative effect-size measure uses the boundary of the confidence interval that is closer to the null hypothesis (\(H_0: \rho = 1\)) as a keyness score. If \(\rho_{-} \leq 1 \leq \rho_{+}\), the candidate is not significant and should be filtered out.
I propose LRC (conservative LogRatio) as a best-practice keyness measure. LRC determines a conservative estimate for \(\log_2 \rho\) at a family-wise level, i.e. using a Bonferroni-corrected significance level \(\alpha'\).
LRC is based on an exact conditional Poisson test (Fay 2010: 55) and confidence interval, using the approximated Poisson sampling distribution of the observed contingency table with rates \(\lambda_1 = n_1 \pi_1\) and \(\lambda_2 = n_2 \pi_2\). Conditioning on the row sum \(f_1 + f_2\) results in a binomial distribution (Lehmann & Romano 2005: 125)
\[ \mathbb{P}(f_1 | f_1 + f_2) = \binom{f_1 + f_2}{f_1} \left( \frac{\lambda_1}{\lambda_1 + \lambda_2} \right)^{f_1} \left( 1 - \frac{\lambda_1}{\lambda_1 + \lambda_2} \right)^{f_2} \]
An exact Clopper-Pearson confidence interval for the binomial
parameter \(\tau = \frac{\lambda_1}{\lambda_1
+ \lambda_2}\) can be obtained efficiently via (the inverse of)
the incomplete Beta function. This corresponds to the “central” method
recommended by Fay (2010: 53). The code below – copied from the
corpora package – is equivalent to the implementation in
the built-in binom.test() function.
binom.confint <- function(k, n, conf.level=0.95, correct=FALSE,
alternative=c("two.sided", "less", "greater")) {
alternative <- match.arg(alternative)
stopifnot(all(k >= 0) && all(k <= n) && all(n >= 1))
stopifnot(all(conf.level >= 0) && all(conf.level <= 1))
## significance level for underlying hypothesis test (with optional Bonferroni correction)
alpha <- if (alternative == "two.sided") (1 - conf.level) / 2 else (1 - conf.level)
if (correct) alpha <- alpha / length(k) # Bonferroni correction
alpha <- rep_len(alpha, length(k)) # needs to be vector for safe.qbeta()
## Clopper-Pearson method: invert binomial test (using incomplete Beta function)
lower <- safe.qbeta(alpha, k, n - k + 1)
upper <- safe.qbeta(alpha, k + 1, n - k, lower.tail=FALSE)
switch(alternative,
two.sided = data.frame(lower = lower, upper = upper),
less = data.frame(lower = 0, upper = upper),
greater = data.frame(lower = lower, upper = 1))
}
## safely compute qbeta even for shape parameters alpha == 0 or beta == 0
safe.qbeta <- function (p, shape1, shape2, lower.tail=TRUE) {
stopifnot(length(p) == length(shape1) && length(p) == length(shape2))
is.0 <- shape1 <= 0
is.1 <- shape2 <= 0
ok <- !(is.0 | is.1)
x <- numeric(length(p))
x[ok] <- qbeta(p[ok], shape1[ok], shape2[ok], lower.tail=lower.tail) # shape parameters are valid
x[is.0 & !is.1] <- 0 # density concentrated at x = 0 (for alpha == 0)
x[is.1 & !is.0] <- 1 # density concentrated at x = 1 (for beta == 0)
x[is.0 & is.1] <- NA # shouldn't happen in our case (alpha == beta == 0)
x
}
The confidence interval \([\tau_{-}, \tau_{+}]\) is then transformed into a confidence interval \([\log_2 \rho_{-}, \log_2 \rho_{+}]\) using the equality
\[ \rho = \frac{n_2 \tau}{n_1 (1 - \tau)} \]
If this interval contains 0, the null hypothesis \(H_0: \rho = 1\) cannot be rejected and the score of the non-significant candidate is set to 0. Otherwise the end of the interval closer to 0 is returned as a keyness score. Positive values thus indicate significant positive keywords and negative values indicate significant negative keywords.
My reference implementation LRC() below takes the same
arguments as G2() as well as an additional confidence level
\(1 - \alpha\). By default
(correct=TRUE) a Bonferroni correction is applied to obtain
the significance level \(\alpha' = \alpha
/ m\) for the confidence intervals of individual candidates. An
additional benefit of LRC is that it also works for \(f_1 = 0\) or \(f_2 = 0\).
LRC <- function (f1, f2, N1, N2, conf.level=.95, correct=TRUE) {
stopifnot(length(f1) == length(f2))
stopifnot(all(f1 + f2 >= 1))
## exact confidence interval from conditional Poisson test (two-sided)
tau <- binom.confint(f1, f1 + f2, conf.level=conf.level, correct=correct, alternative="two.sided")
ifelse(f1 / N1 >= f2 / N2,
pmax(log2( (N2 / N1) * tau$lower / (1 - tau$lower) ), 0), # p1 >= p2 -> use lower bound (clamped to >= 0)
pmin(log2( (N2 / N1) * tau$upper / (1 - tau$upper) ), 0)) # p1 < p2 -> use upper bound (clamped to <= 0)
}
NB: These variants of LRC are not recommended as best-practice measures and included here only for completeness and for comparison in the experiments below.
Positive LRC computes a one-sided confidence interval and always returns its lower bound \(\log_2 \rho_{-}\). It thus identifies only positive keywords (\(\pi_1 > \pi_2\)); candidates with negative scores are not significant (but can still be included in a ranking and possibly considered in a linguistic interpretation). However, one cannot determine from these scores whether a term occurs more frequently in sample A or in sample B.
PositiveLRC <- function (f1, f2, N1, N2, conf.level=.95, correct=TRUE) {
stopifnot(length(f1) == length(f2))
stopifnot(all(f1 + f2 >= 1))
## exact confidence interval from conditional Poisson test (one-sided)
tau <- binom.confint(f1, f1 + f2, conf.level=conf.level, correct=correct, alternative="greater")
log2( (N2 / N1) * tau$lower / (1 - tau$lower) )
}
The version of LRC currently implemented in CQPweb is based on asymptotic confidence intervals from a normal approximation rather than the exact conditional Poisson test. It has serious numerical and conceptual problems, many of which are inherited from Hardie’s adjustment \(f_2 = \frac12\) in the case of \(f_2 = 0\). [Update] These problems are alleviated somewhat if Walter’s (1975) point estimate and a similar nearly unbiased estimate \[ \hat{\sigma}(\log r) = \sqrt{ \frac{1}{f_1 + \frac12} + \frac{1}{f_2 + \frac12} - \frac{1}{n_1 + \frac12} - \frac{1}{n_2 + \frac12} } \] for its standard deviation are used (Gart & mo Nam 1988: 324).
ApproxLRC <- function (f1, f2, N1, N2, conf.level=.95, correct=TRUE, adjust=c("walter", "hardie")) {
stopifnot(length(f1) == length(f2))
stopifnot(all(f1 >= 1))
adjust <- match.arg(adjust)
alpha <- 1 - conf.level # desired significance level
if (correct) alpha <- alpha / length(f1) # Bonferroni correction
## approximate confidence interval based on asymptotic standard deviation of log(rr)
if (adjust == "hardie") {
f2.disc <- pmax(f2, 0.5) # with Hardie's (2014) questionable discounting
lrr <- log((f1 / N1) / (f2.disc / N2))
lrr.sd <- sqrt(1/f1 + 1/f2.disc - 1/N1 - 1/N2) # asymptotic s.d. of log(RR) according to Wikipedia
}
else {
lrr <- log((f1 + 0.5) / (N1 + 0.5)) - log((f2 + 0.5) / (N2 + 0.5))
lrr.sd <- sqrt(1/(f1 + 0.5) + 1/(f2 + 0.5) - 1/(N1 + 0.5) - 1/(N2 + 0.5))
}
z.factor <- -qnorm(alpha / 2) # number of s.d. for asymptotic two-sided conf.int
lrr <- ifelse(lrr >= 0,
pmax(lrr - z.factor * lrr.sd, 0), # log(RR) >= 0 -> use lower bound (clamped to >= 0)
pmin(lrr + z.factor * lrr.sd, 0)) # log(RR) < 0 -> use upper bound (clamped to <= 0)
lrr / log(2) # adjust to log2 units
}
As an illustrative example and for visualisations, we provide a data set of keyword candidates from the evaluation study of Evert et al. (2018) on discourses around multi-resistant pathogens (MRO) in Germany. In this study, an earlier version of LRC achieved slightly better evaluation results than LogRatio and log-likelihood, providing a balance between the low-frequency bias of the former and the high-frequency bias of the latter.
The tabular file sample_data.tsv contains document
frequencies for 20,109 candidate terms in the target corpus of online
articles on MRO (\(f_1\)) and in a
newspaper reference corpus ($f_2$). The data set includes all terms with
\(f_1 \geq 2\) in the target corpus.
The sample sizes – i.e. number of text in the target (\(n_1\)) and referene (\(n_2\)) corpus – have to be provided
separately.
Keywords <- read.delim("sample_data.tsv", quote="", stringsAsFactors=FALSE, fileEncoding="UTF-8")
n1 <- 1177 # target corpus (document count)
n2 <- 341805 # references corpus
We can now apply the different keyness measures to the example data set. Note that the default discount \(\lambda = 1\) of SimpleMaths is designed for token frequencies per million words, but our relative document frequencies are often considerably higher. Assuming one occurrence per document, relative frequency should increase by a factor corresponding to the average document size (which is very roughly 500 tokens in this data set), so \(\lambda = 100\) seems a reasonable choice.
Keywords <- transform(
Keywords,
G2 = G2(f1, f2, n1, n2),
LogRatioHardie = LogRatio(f1, f2, n1, n2, adjust="hardie"),
LogRatio = LogRatio(f1, f2, n1, n2),
LogRatio.G2 = LogRatio(f1, f2, n1, n2, alpha=.05),
SM = SimpleMaths(f1, f2, n1, n2, lambda=100),
LRC = LRC(f1, f2, n1, n2),
LRC.positive = PositiveLRC(f1, f2, n1, n2),
LRC.hardie = ApproxLRC(f1, f2, n1, n2, adjust="hardie"),
LRC.normal = ApproxLRC(f1, f2, n1, n2))
Among 20109 candidates there are 6143 significant positive keywords according to a \(G^2\) test with Bonferroni correction, and 4722 according to LRC.
To gain a first impression of the keyness measures, we can look at
the top-20 keywords for each measure. Note that this superficial
impression does not accurately reflect the usefulness of the measures
for an actual corpus study that will take a much larger number of
top-ranked candidates into account. The show.top.kw()
function always shows LRC scores and corresponding ranks for
comparison.
show.top.kw <- function (df, col, n=20) {
idx <- order(Keywords[[col]], decreasing=TRUE)[seq_len(min(n, nrow(df)))]
top <- Keywords[idx, c("lemma", "f1", "f2", col)]
top$rank <- seq_len(n)
if (col != "LRC") {
df$r.LRC <- rank(-df$LRC, ties.method="first")
top <- cbind(top, df[idx, c("LRC", "r.LRC")])
}
top
}
show.top.kw(Keywords, "G2")
## lemma f1 f2 G2 rank LRC r.LRC
## 663 Antibiotikum 944 367 10290.405 1 9.132239 59
## 5935 Keim 936 805 9325.546 2 8.071781 143
## 17607 resistent 778 244 8397.957 3 9.365925 48
## 5466 Infektion 780 699 7503.595 4 7.984664 152
## 1155 Bakterie 748 528 7388.316 5 8.298289 121
## 16949 multiresistent 624 70 7039.846 6 10.530982 11
## 13204 ander 574 18 6693.887 7 11.757507 1
## 3233 Erreger 665 473 6480.059 8 8.264729 123
## 8613 Patient 819 5370 5253.111 9 5.207546 700
## 9451 Resistenz 482 130 5069.018 10 9.420591 44
## 7185 MRSA 411 13 4711.916 11 11.530551 2
## 6425 Krankenhaus 733 7074 4073.472 12 4.640064 886
## 5170 Hygiene 420 403 3799.236 13 7.763843 177
## 1207 Bakterium 342 117 3471.324 14 9.019495 64
## 6138 Klinik 575 4169 3378.359 15 5.012148 803
## 16194 infizieren 392 754 3127.711 16 6.805293 326
## 6467 Krankenhauskeim 269 17 2990.714 17 10.672729 7
## 10645 Staphylococcus 258 11 2897.350 18 10.961830 4
## 13556 aureus 257 11 2885.592 19 10.955959 5
## 6462 Krankenhaushygiene 260 28 2829.297 20 10.164283 18
show.top.kw(Keywords, "LogRatio")
## lemma f1 f2 LogRatio rank LRC r.LRC
## 2274 Colistin 86 0 15.61593 1 10.727321 6
## 19998 über 79 0 15.49419 2 10.594453 9
## 7869 Multiresistente 78 0 15.47593 3 10.574436 10
## 14416 desto 72 0 15.36121 4 10.448163 12
## 11697 VRE 60 0 15.10017 5 10.156939 20
## 618 Antibiotika-resistent 55 0 14.97572 6 10.015960 23
## 16670 lieber 51 0 14.86781 7 9.892440 26
## 14183 bis 43 0 14.62425 8 9.608762 34
## 17174 of 42 0 14.59070 9 9.569094 36
## 10882 Superbakterie 41 0 14.55634 10 9.528322 37
## 2332 DGKH 38 0 14.44809 11 9.398728 46
## 16811 meisten 35 0 14.33105 12 9.256585 49
## 4573 Grünen 34 0 14.28983 13 9.205982 51
## 10883 Superbakterium 34 0 14.28983 14 9.205982 52
## 2749 ESBL-bildend 33 0 14.24739 15 9.153581 54
## 11711 Vancomycin-resistent 33 0 14.24739 16 9.153581 55
## 7202 MRSA-Fall 31 0 14.15858 17 9.042852 63
## 10532 Spitalhygiene 30 0 14.11204 18 8.984218 68
## 10678 Stattdessen 30 0 14.11204 19 8.984218 69
## 7212 MRSA-Patient 29 0 14.06395 20 8.923170 75
show.top.kw(Keywords, "LogRatio.G2")
## lemma f1 f2 LogRatio.G2 rank LRC r.LRC
## 2274 Colistin 86 0 15.61593 1 10.727321 6
## 19998 über 79 0 15.49419 2 10.594453 9
## 7869 Multiresistente 78 0 15.47593 3 10.574436 10
## 14416 desto 72 0 15.36121 4 10.448163 12
## 11697 VRE 60 0 15.10017 5 10.156939 20
## 618 Antibiotika-resistent 55 0 14.97572 6 10.015960 23
## 16670 lieber 51 0 14.86781 7 9.892440 26
## 14183 bis 43 0 14.62425 8 9.608762 34
## 17174 of 42 0 14.59070 9 9.569094 36
## 10882 Superbakterie 41 0 14.55634 10 9.528322 37
## 2332 DGKH 38 0 14.44809 11 9.398728 46
## 16811 meisten 35 0 14.33105 12 9.256585 49
## 4573 Grünen 34 0 14.28983 13 9.205982 51
## 10883 Superbakterium 34 0 14.28983 14 9.205982 52
## 2749 ESBL-bildend 33 0 14.24739 15 9.153581 54
## 11711 Vancomycin-resistent 33 0 14.24739 16 9.153581 55
## 7202 MRSA-Fall 31 0 14.15858 17 9.042852 63
## 10532 Spitalhygiene 30 0 14.11204 18 8.984218 68
## 10678 Stattdessen 30 0 14.11204 19 8.984218 69
## 7212 MRSA-Patient 29 0 14.06395 20 8.923170 75
show.top.kw(Keywords, "SM")
## lemma f1 f2 SM rank LRC r.LRC
## 13204 ander 574 18 3195.1750 1 11.757507 1
## 7185 MRSA 411 13 2530.4956 2 11.530551 2
## 16949 multiresistent 624 70 1739.7306 3 10.530982 11
## 10645 Staphylococcus 258 11 1659.0853 4 10.961830 4
## 13556 aureus 257 11 1652.6576 5 10.955959 5
## 6467 Krankenhauskeim 269 17 1527.0023 6 10.672729 7
## 9440 Reserveantibiotikum 162 3 1266.2433 7 11.041688 3
## 6462 Krankenhaushygiene 260 28 1214.8360 8 10.164283 18
## 645 Antibiotikaresistenz 180 15 1063.5683 9 10.163151 19
## 7207 MRSA-Keim 134 4 1020.1087 10 10.610060 8
## 2736 ESBL 118 4 898.4115 11 10.412481 14
## 1189 Bakterienstamm 143 14 862.6289 12 9.861341 27
## 9451 Resistenz 482 130 852.7730 13 9.420591 44
## 12610 Wundinfektion 141 16 816.6735 14 9.725766 32
## 17607 resistent 778 244 812.3077 15 9.365925 48
## 211 Acinetobacter 112 6 810.3279 16 10.096282 21
## 596 Antibiotika-Resistenz 117 8 806.3299 17 9.973964 25
## 13373 antibiotikaresistent 132 14 796.3274 18 9.734820 31
## 17344 pneumoniae 98 2 787.5438 19 10.431809 13
## 8669 Penicillin 152 23 772.5627 20 9.516115 40
show.top.kw(Keywords, "LRC")
## lemma f1 f2 LRC rank
## 13204 ander 574 18 11.75751 1
## 7185 MRSA 411 13 11.53055 2
## 9440 Reserveantibiotikum 162 3 11.04169 3
## 10645 Staphylococcus 258 11 10.96183 4
## 13556 aureus 257 11 10.95596 5
## 2274 Colistin 86 0 10.72732 6
## 6467 Krankenhauskeim 269 17 10.67273 7
## 7207 MRSA-Keim 134 4 10.61006 8
## 19998 über 79 0 10.59445 9
## 7869 Multiresistente 78 0 10.57444 10
## 16949 multiresistent 624 70 10.53098 11
## 14416 desto 72 0 10.44816 12
## 17344 pneumoniae 98 2 10.43181 13
## 2736 ESBL 118 4 10.41248 14
## 9464 Resistenzentwicklung 78 1 10.28177 15
## 15752 gramnegativ 78 1 10.28177 16
## 2113 Carbapeneme 89 2 10.28002 17
## 6462 Krankenhaushygiene 260 28 10.16428 18
## 645 Antibiotikaresistenz 180 15 10.16315 19
## 11697 VRE 60 0 10.15694 20
We also compute rankings for all keyness measures (but in a new data
frame Rankings to keep the original free of clutter). We
omit LogRatio.filter because its ranks are identical to
LogRatio for all significant positive keywords; the same
holds for LRC.positive.
my.rank <- function (x) rank(-x, ties.method="first")
Rankings <- transform(
Keywords,
r.G2 = my.rank(G2),
r.LogRatioHardie = my.rank(LogRatioHardie),
r.LogRatio = my.rank(LogRatio),
r.SM = my.rank(SM),
r.LRC = my.rank(LRC),
r.LRC.hardie = my.rank(LRC.hardie),
r.LRC.normal = my.rank(LRC.normal))
As a further indication of how the keyness measures differ, we compare lists of top-250 keywords from each measure. The overlap between such lists can be visualised with Venn diagrams:
tmp <- with(Rankings, data.frame(
G2 = r.G2 <= 250, LogRatio = r.LogRatio <= 250, SM = r.SM <= 250,
LRC = r.LRC <= 250, LRC.norm = r.LRC.normal <= 250, LRC.hardie = r.LRC.hardie <= 250))
tmp1 <- subset(tmp, G2 | LogRatio | SM | LRC, qw("G2 SM LRC LogRatio"))
venn(tmp1, ilcs=1.5, sncs=1.7, zcolor=muted.pal, box=FALSE)
tmp2 <- subset(tmp, LRC | LRC.hardie | LRC.norm, qw("LRC LRC.norm LRC.hardie"))
venn(tmp2, ilcs=1.5, sncs=1.7, zcolor=muted.pal, box=FALSE)
if (export.plots) {
cairo_pdf(file="img/venn1_top250.pdf", width=8, height=8)
venn(tmp1, ilcs=1.5, sncs=1.7, zcolor=muted.pal, box=FALSE)
invisible(dev.off())
cairo_pdf(file="img/venn2_top250.pdf", width=8, height=8)
venn(tmp2, ilcs=1.5, sncs=1.7, zcolor=muted.pal, box=FALSE)
invisible(dev.off())
}
A histogram of \(f_1\) for the top-250 candidate sets reveals that the overlap is primarily explained by differences in the frequency profiles. Note the logarithmic scale on the \(f_1\)-axis. LRC has the most balaned frequency distribution, covering a plausible range of mid-frequency terms, but excluding too general high-frequency terms.
tmp <- melt(Rankings, id=qw("lemma f1 f2"), measure=qw("r.G2 r.SM r.LRC r.LogRatio"),
variable.name="measure", value.name="rank")
tmp <- subset(tmp, rank <= 250)
hist.pal <- structure(muted.pal[1:4], names=qw("r.G2 r.SM r.LRC r.LogRatio"))
plt <- ggplot(tmp, aes(x=f1, after_stat(count), fill=measure)) +
scale_x_log10(limits=c(4, 2e3)) +
geom_density(alpha=.4) +
scale_fill_manual(values=hist.pal, labels=expression(G^2, SM, LRC, LogRatio)) +
xlab("frequency in target corpus") + ylab("density of keywords") +
theme_grey(base_size=15) + theme(legend.position="top")
print(plt)
if (export.plots) {
cairo_pdf(file="img/histogram_f1_top250.pdf", width=8, height=8)
print(plt)
invisible(dev.off())
}
tmp <- melt(Rankings, id=qw("lemma f1 f2"), measure=qw("r.G2 r.SM r.LRC r.LogRatio"),
variable.name="measure", value.name="rank")
tmp <- subset(tmp, rank <= 250)
hist.pal <- structure(muted.pal[1:4], names=qw("r.G2 r.SM r.LRC r.LogRatio"))
mk.plt <- function (x) {
ggplot(x, aes(x=f1, after_stat(count), fill=measure)) +
scale_x_log10(limits=c(4, 2e3)) +
geom_density(alpha=.4) +
scale_fill_manual(values=hist.pal, labels=expression(G^2, SM, LRC, LogRatio)) +
xlab("frequency in target corpus") + ylab("density of keywords") +
theme_grey(base_size=15) + theme(legend.position="top")
}
if (export.plots) {
cairo_pdf(file="img/histogram_f1_top250_G2.pdf", width=12, height=4)
print(mk.plt(subset(tmp, measure %in% qw("r.G2"))))
invisible(dev.off())
cairo_pdf(file="img/histogram_f1_top250_G2_LR.pdf", width=12, height=4)
print(mk.plt(subset(tmp, measure %in% qw("r.G2 r.LogRatio"))))
invisible(dev.off())
cairo_pdf(file="img/histogram_f1_top250_G2_LR_LRC.pdf", width=12, height=4)
print(mk.plt(subset(tmp, measure %in% qw("r.G2 r.LogRatio r.LRC"))))
invisible(dev.off())
}
For a given pair of samples A and B, the sample sizes \(n_1\) and \(n_2\) are fixed for all candidates in the data set, and keyness scores depend only on the pair of observed frequencies \((f_1, f_2)\). Interpreting these pairs as coordinates in a two-timensional plane, keyness measues can be visualised as topographic maps over this plane. Such plots can give further insights into the difference mathematical properties of keyness measures.
The function topographic.map(LogRatio, ...) visualises
the keyness measure implemented by LogRatio(), with
detailed control over the frequency ranges to be plotted and many other
graphical aspects. If z.log=TRUE then keyness scores are
visualised on a logarithmic scale (supporting both positive and negative
scores). A data frame with candidate terms and their frequencies can be
passed in the optional argument data= and be overlaid as a
scatterplot.
signed.log <- function (x) sign(x) * log10(abs(x) + 1)
signed.exp <- function (x) sign(x) * (10^abs(x) - 1)
topographic.map <- function (MEASURE, xlim=c(0, 1e5), ylim=c(1e2, 1e6), zlim=NULL,
N1=10e6, N2=10e6, grid=400, nlevels=20, z.log=FALSE, main="",
data=NULL, pch=20, cex=.5, col="#00552288", jitter=.05, ...) {
xlim.log <- signed.log(xlim)
ylim.log <- signed.log(ylim)
x.log <- seq(xlim.log[1], xlim.log[2], length.out=grid)
y.log <- seq(ylim.log[1], ylim.log[2], length.out=grid)
x <- signed.exp(x.log)
y <- signed.exp(y.log)
z.mat <- outer(y, x, MEASURE, N1=N1, N2=N2, ...) # evaluate keyness measure on grid
if (z.log) z.mat <- signed.log(z.mat)
col.range <- c(hsv(h=2/3, v=1, s=(100:1)/100), rgb(1,1,1), hsv(h=0, v=1, s=(1:100/100)))
if (is.null(zlim)) zlim <- c(-1,1) * max(abs(z.mat))
image(x.log, y.log, t(z.mat), xlim=xlim.log, ylim=ylim.log, zlim=zlim,
col=col.range, xlab=expression(log[10](f[2]+1)), ylab=expression(log[10](f[1]+1)), main=main)
contour(x.log, y.log, t(z.mat), nlevels=nlevels, labcex=1, add=TRUE)
if (!is.null(data)) {
set.seed(42)
points(jitter(signed.log(data$f2), amount=jitter), jitter(signed.log(data$f1), amount=jitter),
pch=pch, cex=cex, col=col)
}
}
\(n_1 = n_2 = 100 \text{M}\) tokens, assuming that we’re testing \(m = 100 \text{k}\) candidate terms (for Bonferroni correction). Note that we have to pass the adjusted significance level \(\alpha'\) manually because there is no underlying data set.
conf.adjust <- 1 - .05 / 1e5 # Bonferroni-adjusted confidence level of 95% for m=100,000 candidates
my.topomap <- function (MEASURE, ...) topographic.map(MEASURE, N1=100e6, N2=100e6, ...)
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
my.topomap(LogRatio, adjust="hardie", main="LogRatio (Hardie)")
my.topomap(LogRatio, main="LogRatio")
my.topomap(LogRatio, alpha=(1 - conf.adjust), correct=FALSE, main="LogRatio + LLR filter")
my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
my.topomap(PositiveLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (positive)")
my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
my.topomap(ApproxLRC, adjust="hardie", conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx. by Hardie)")
my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
my.topomap(SimpleMaths, lambda=1e-2, log=TRUE, main="log SimpleMaths (λ = .01)")
my.topomap(SimpleMaths, lambda=1e+2, log=TRUE, main="log SimpleMaths (λ = 100)")
if (export.plots) {
cairo_pdf(file="img/topomap_100M_100M.pdf", width=12, height=8)
par(mar=c(4, 4, 1, 1)+.2, mfrow=c(2,3))
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
my.topomap(LogRatio, main="LogRatio")
my.topomap(LogRatio, alpha=(1 - conf.adjust), main="LogRatio + LLR filter")
my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
invisible(dev.off())
}
\(n_1 = n_2 = 1 \text{M}\) tokens, assuming that we’re testing \(m = 5 \text{k}\) candidate terms.
conf.adjust <- 1 - .01 / 5e3
my.topomap <- function (MEASURE, ...) topographic.map(MEASURE, N1=1e6, N2=1e6, xlim=c(0, 1e5), ylim=c(5, 1e5), ...)
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
my.topomap(LogRatio, adjust="hardie", main="LogRatio (Hardie)")
my.topomap(LogRatio, main="LogRatio")
my.topomap(LogRatio, alpha=(1 - conf.adjust), correct=FALSE, main="LogRatio + LLR filter")
my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
my.topomap(PositiveLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (positive)")
my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
my.topomap(ApproxLRC, adjust="hardie", conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx. by Hardie)")
my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
my.topomap(SimpleMaths, lambda=1e-2, log=TRUE, main="log SimpleMaths (λ = .01)")
my.topomap(SimpleMaths, lambda=1e+2, log=TRUE, main="log SimpleMaths (λ = 100)")
if (export.plots) {
cairo_pdf(file="img/topomap_1M_1M.pdf", width=12, height=8)
par(mar=c(4, 4, 1, 1)+.2, mfrow=c(2,3))
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
my.topomap(LogRatio, main="LogRatio")
my.topomap(LogRatio, alpha=(1 - conf.adjust), main="LogRatio + LLR filter")
my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
invisible(dev.off())
}
\(n_1 = 1 \text{M}\) vs. \(n_2 = 100 \text{M}\) tokens, assuming that we’re testing \(m = 5 \text{k}\) candidate terms.
conf.adjust <- 1 - .01 / 5e3
my.topomap <- function (MEASURE, ...) topographic.map(MEASURE, N1=1e6, N2=100e6, xlim=c(0, 1e6), ylim=c(5, 1e5), ...)
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
my.topomap(LogRatio, adjust="hardie", main="LogRatio (Hardie)")
my.topomap(LogRatio, main="LogRatio")
my.topomap(LogRatio, alpha=(1 - conf.adjust), correct=FALSE, main="LogRatio + LLR filter")
my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
my.topomap(PositiveLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (positive)")
my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
my.topomap(ApproxLRC, adjust="hardie", conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx. by Hardie)")
my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
my.topomap(SimpleMaths, lambda=1e-2, log=TRUE, main="log SimpleMaths (λ = .01)")
my.topomap(SimpleMaths, lambda=1e+2, log=TRUE, main="log SimpleMaths (λ = 100)")
if (export.plots) {
cairo_pdf(file="img/topomap_1M_100M.pdf", width=12, height=8)
par(mar=c(4, 4, 1, 1)+.2, mfrow=c(2,3))
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
my.topomap(LogRatio, main="LogRatio")
my.topomap(LogRatio, alpha=(1 - conf.adjust), main="LogRatio + LLR filter")
my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
invisible(dev.off())
}
Finally, we visualise the realistic setting of the example data set based on document frequencies. As a point of comparison, we show the distribution of candidate terms that belong to any of the top-250 lists compared above.
tmp <- Rankings[, grepl("^r[.]", colnames(Rankings), perl=TRUE)]
idx <- apply(tmp, 1, min) <= 250
ShowCand <- Rankings[idx, qw("lemma f1 f2")]
conf.adjust <- 1 - .01 / nrow(Rankings)
my.topomap <- function (MEASURE, ...) topographic.map(MEASURE, N1=n1, N2=n2, xlim=c(0, 100e3), ylim=c(1, 1000), data=ShowCand, cex=1, col="#006622",...)
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
my.topomap(LogRatio, adjust="hardie", main="LogRatio (Hardie)")
my.topomap(LogRatio, main="LogRatio")
my.topomap(LogRatio, alpha=(1 - conf.adjust), correct=FALSE, main="LogRatio + LLR filter")
my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
my.topomap(PositiveLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (positive)")
my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
my.topomap(ApproxLRC, adjust="hardie", conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx. by Hardie)")
my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
my.topomap(SimpleMaths, lambda=1e-2, log=TRUE, main="log SimpleMaths (λ = .01)")
my.topomap(SimpleMaths, lambda=1e+2, log=TRUE, main="log SimpleMaths (λ = 100)")
if (export.plots) {
cairo_pdf(file="img/topomap_Keywords.pdf", width=12, height=8)
par(mar=c(4, 4, 1, 1)+.2, mfrow=c(2,3))
my.topomap(G2, z.log=TRUE, main="Log-Likelihood (LLR)")
my.topomap(LogRatio, main="LogRatio")
my.topomap(LogRatio, alpha=(1 - conf.adjust), main="LogRatio + LLR filter")
my.topomap(SimpleMaths, log=TRUE, main="log SimpleMaths (λ = 1)")
my.topomap(LRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC")
my.topomap(ApproxLRC, conf.level=conf.adjust, correct=FALSE, nlevels=15, main="LRC (normal approx.)")
invisible(dev.off())
}